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Abstract 

Object proposals have quickly become the de-facto pre¬ 
processing step in a number of vision pipelines (for object 
detection, object discovery, and other tasks). Their perfor¬ 
mance is usually evaluated on partially annotated datasets. 
In this paper, we argue that the choice of using a par¬ 
tially annotated dataset for evaluation of object proposals 
is problematic - as we demonstrate via a thought experi¬ 
ment, the evaluation protocol is 'gameable\ in the sense 
that progress under this protocol does not necessarily cor¬ 
respond to a ‘"better” category independent object proposal 
algorithm. 

To alleviate this problem, we: (1) Introduce a nearly-fully 
annotated version of PASCAL VOC dataset, which serves as 
a test-bed to check if object proposal techniques are over¬ 
fitting to a particular list of categories. (2) Perform an ex¬ 
haustive evaluation of object proposal methods on our in¬ 
troduced nearly-fully annotated PASCAL dataset and per¬ 
form cross-dataset generalization experiments; and (3) In¬ 
troduce a diagnostic experiment to detect the bias capac¬ 
ity in an object proposal algorithm. This tool circumvents 
the need to collect a densely annotated dataset, which can 
be expensive and cumbersome to collect. Finally, we plan 
to release an easy-to-use toolbox which combines various 
publicly available implementations of object proposal al¬ 
gorithms which standardizes the proposal generation and 
evaluation so that new methods can be added and evaluated 
on different datasets. We hope that the results presented in 
the paper will motivate the community to test the category 
independence of various object proposal methods by care¬ 
fully choosing the evaluation protocol. 

1. Introduction 

In the last few years, the Computer Vision community has 
witnessed the emergence of a new class of techniques called 
Object Proposal algorithms [1-11]. 

Object proposals are a set of candidate regions or bounding 
boxes in an image that may potentially contain an object. 

Object proposal algorithms have quickly become the de- 
facto pre-processing step in a number of vision pipelines 
- object detection [12-21], segmentation [22-26], ob¬ 


ject discovery [27-30], weakly supervised learning of 
object-object interactions [31,32], content aware media re¬ 
targeting [33], action recognition in still images [34] and 
visual tracking [35, 36]. Of all these tasks, object pro¬ 
posals have been particularly successful in object detection 
systems. For example, nearly all top-performing entries 
[13,37-39] in the ImageNet Detection Challenge 2014 [40] 
used object proposals. They are preferred over the formerly 
used sliding window paradigm due to their computational 
efficiency. Objects present in an image may vary in loca¬ 
tion, size, and aspect ratio. Performing an exhaustive search 
over such a high dimensional space is difficult. By using 
object proposals, computational effort can be focused on a 
small number of candidate windows. 

The focus of this paper is the protocol used for evaluating 
object proposals. Let us begin by asking - what is the pur¬ 
pose of an object proposal algorithm? 

In early works [2,4,6], the emphasis was on category inde¬ 
pendent object proposals, where the goal is to identify in¬ 
stances of all objects in the image irrespective of their cate¬ 
gory. While it can be tricky to precisely define what an “ob¬ 
ject” is\ these early works presented cross-category evalu¬ 
ations to establish and measure category independence. 

More recently, object proposals are increasingly viewed as 
detection proposals [1, 8, 11,42] where the goal is to im¬ 
prove the object detection pipeline, focusing on a chosen 
set of object classes (e.g. --20 PASCAL categories). In 
fact, many modern proposal methods are learning-based 
[9-11,42-46] where the definition of an “object” is the set 
of annotated classes in the dataset. This increasingly blurs 
the boundary between a proposal algorithm and a detector. 

Notice that the former definition has an emphasis on ob¬ 
ject discovery [27,28,30], while the latter definition empha¬ 
sises on the ultimate performance of a detection pipeline. 
Surprisingly, despite the two different goals of ‘object pro¬ 
posal,’ there exists only a single evaluation protocol: 

1. Generate proposals on a dataset: The most commonly 
used dataset for evaluation today is the PASCAL VOC 


^Most category independent object proposal methods define an object 
as “stand-alone thing with a well-defined closed-boundary”. For “thing" 
V5. “stuff” discussion, see [41]. 
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(a) (Green) Annotated, (Red) Unannotated 


(b) Method 1 with recall 0.6 


(c) Method 2 with recall 1 


Figure 1: (a) shows PASCAL annotations natively present in the dataset in green. Other objects that are not annotated but present in the 
image are shown in red; (b) shows Method 1 and (c) shows Method 2. Method 1 visually seems to recall more categories such as plates, 
glasses, etc. that Method 2 missed. Despite that, the computed recall for Method 2 is higher because it recalled all instances of PASCAL 
categories that were present in the ground truth. Note that the number of proposals generated by both methods is equal in this figure. 



(a) (Green) Annotated, (Red) Unannotated (b) Method 1 with recall 0.5 


Figure 2: (a) shows PASCAL annotations natively present in the dataset in green. Other objects that are not annotated but present in the 
image are shown in red; (b) shows Method 1 and (c) shows Method 2. Method 1 visually seems to recall more categories such as lamps, 
picture, etc. that Method 2 missed. Clearly the recall for Method 1 should be higher. However, the calculated recall for Method 2 is 
significantly higher, which is counter-intuitive. This is because Method 2 recalls more PASCAL category objects. 



(c) Method 2 with recall 0.83 


[47] detection set. Note that this is a partially anno¬ 
tated dataset where only the 20 PASCAL category in¬ 
stances are annotated. 

2. Measure the performance of the generated proposals: 
typically in terms of ‘recall’ of the annotated instances. 
Commonly used metrics are described in Section 3. 

The central thesis of this paper is that the current evaluation 
protocol for object proposal methods is suitable for object 
detection pipeline but is a 'gameable ’ and misleading pro¬ 
tocol for category independent tasks. By evaluating only 
on a specific set of object categories, we fail to capture the 
performance of the proposal algorithms on all the remain¬ 
ing object categories that are present in the test set, but not 
annotated in the ground truth. 

Figs. 1, 2 illustrate this idea on images from PASCAL VOC 
2010. Column (a) shows the ground-truth object anno¬ 
tations (in green, the annotations natively present in the 
dataset for the 20 PASCAL categories -‘chairs’, ‘tables’, 
‘bottles’, etc.\ in red, the annotations that we added to 
the dataset by marking object such as ‘ceiling fan’, ‘table 
lamp’, ‘window’, etc. originally annotated ‘background’ in 
the dataset). Columns (b) and (c) show the outputs of two 
object proposal methods. Top row shows the case when 
both methods produce the same number of proposals; bot¬ 


tom row shows unequal number of proposals. We can 
see that proposal method in Column (b) seems to be more 
“complete”, in the sense that it recalls or discovers a large 
number of instances. For instance, in the top row it detects 
a number of non-PASCAL categories (‘plate’, ‘bowl’, ‘pic¬ 
ture frame’, etc.) but misses out on finding the PASCAL 
category ‘table’. In both rows, the method in Column (c) 
is reported as achieving a higher recall, even in the bottom 
row, when it recalls strictly fewer objects, not just different 
ones. The reason is that Column (c) recalls/discovers in¬ 
stances of the 20 PASCAL categories, which are the only 
ones annotated in the dataset. Thus, Method 2 appears to be 
a better object proposal generator simply because it focuses 
on the annotated categories in the dataset. 

While intuitive (and somewhat obvious) in hindsight, we 
believe this is a crucial finding because it makes the current 
protocol 'gameable’ or susceptible to manipulation (both 
intentional and unintentional) and misleading for measuring 
improvement in category independent object proposals. 

Some might argue that if the end task is to detect a cer¬ 
tain set of categories (20 PASCAL or 80 COCO categories) 
then it is enough to evaluate on them and there is no need 
to care about other categories which are not annotated in 
the dataset. We agree, but it is important to keep in mind 










































that object detection is not the only application of object 
proposals. There are other tasks for which it is important 
for proposal methods to generate category independent pro¬ 
posals. For example, in semi/unsupervised object localiza¬ 
tion [27-30] the goal is to identify all the objects in a given 
image that contains many object classes without any spe¬ 
cific target classes. In this problem, there are no image- 
level annotations, an assumption of a single dominant class, 
or even a known number of object classes [28]. Thus, in 
such a setting, using a proposal method that has tuned itself 
to 20 PASCAL objects would not be ideal - in the worst 
case, we may not discover any new objects. As mentioned 
earlier, there are many such scenarios including learning 
object-object interactions [31,32], content aware media re¬ 
targeting [33], visual tracking [36], etc. 

To summarize, the contributions of this paper are: 

• We report the ‘gameability’ of the current object pro¬ 
posal evaluation protocol. 

• We demonstrate this ‘gameability’ via a simple 
thought experiment where we propose a ‘fraudulent’ 
object proposal method that significantly outperforms 
all existing object proposal techniques on current met¬ 
rics, but would under any no circumstances be consid¬ 
ered a category independent proposal technique. As 
a side contribution of our work, we present a simple 
technique for producing state-of-art object proposals. 

• After establishing the problem, we propose three ways 
of improving the current evaluation protocol to mea¬ 
sure the category independence of object proposals: 

1. evaluation on fully annotated datasets, 

2. cross-dataset evaluation on densely annotated 
datasets. 

3. a new evaluation metric that quantifies the bias 
capacity of proposal generators. 

For the first test, we introduce a nearly-fully annotated 
PASCAL VOC 2010 where we annotated all instances 
of all object categories occurring in the images. 

• We thoroughly evaluate existing proposal methods on 
this nearly-fully and two densely annotated datasets. 

• We will release all code and data for experiments, and 
an object proposals library that allows for easy com¬ 
parison of all popular object proposal techniques. 

2. Related Work 

Types of Object Proposals: Object proposals can be 
broadly categorized into two categories: 

• Window scoring: In these methods, the space of 
all possible windows in an image is sampled to get 
a subset of the windows {e.g., via sliding window). 
These windows are then scored for the presence of 
an object based on the image features from the win¬ 
dows. The algorithms that fall under this category 
are [1,4,5, 10,45,48]. 


• Segment based: These algorithms involve over¬ 
segmenting an image and merging the segments us¬ 
ing some strategy. These methods include [2, 3, 6-9, 
11,44,46, 49]. The generated region proposals can be 
converted to bounding boxes if needed. 

Beyond RGB proposals: Beyond the ones listed above, a 
wide variety of algorithms fall under the umbrella of ‘ob¬ 
ject proposals’. For instance, [50-54] used spatio-temporal 
object proposals for action recognition, segmentation and 
tracking in videos. Another direction of work [55-57] ex¬ 
plores use of RGB-D cuboid proposals in an object detec¬ 
tion and semantic segmentation in RGB-D images. While 
the scope of this paper is limited to proposals in RGB im¬ 
ages, the central thesis of the paper (i.e., gameability of the 
evaluation protocol) is broadly applicable to other settings. 

Evaluating Proposals: There has been a relatively limited 
analysis and evaluation of proposal methods or the proposal 
evaluation protocol. Hosang et al. [58] focus on evaluation 
of object proposal algorithms, in particular the stability of 
such algorithms on parameter changes and image perturba¬ 
tions. Their works shows that a large number of category 
independent proposal algorithms indeed generalize well to 
non-PASCAL categories, for instance in the ImageNet 200 
category detection dataset [40]. Although these findings 
are important (and consistent with our experiments), they 
are unrelated to the ‘gameability’ of the evaluation proto¬ 
col, which is our focus. In [59], authors present an analy¬ 
sis of various proposal methods regarding proposal repeata¬ 
bility, ground truth annotation recall, and their impact on 
detection performance. They also introduced a new eval¬ 
uation metric (Average Recall). Their argument for a new 
metric is the need for a better localization between gener¬ 
ated proposals and ground truth. While this is a valid and 
significant concern, it is orthogonal to the‘gameability’ of 
the evaluation protocol, which to the best of our knowledge 
has not been previously addressed. Another recent related 
work perhaps is [60], which analyzes the state-of-the-art 
methods in segment-based object proposals, focusing on the 
challenges faced when going from PASCAL VOC to MS 
COCO. They also analyze how aligned the proposal meth¬ 
ods are with the bias observed in MS COCO towards small 
objects and the center of the image and propose a method 
to boost their performance. Although there is a discussion 
about biases in datasets but it is unlike our theme, which 
is ‘gameability’ due to these biases. As stated earlier, while 
early papers [2,4,6] reported cross-dataset or cross-category 
generalization experiments similar to ones reported in this 
paper, with the trend of learning-based proposal methods, 
these experiments and concerns seem to have fallen out of 
standard practice, which we show is problematic. 

3. Evaluating Object Proposals 

Before we describe our evaluation and analysis, let us first 
look at the object proposal evaluation protocol that is widely 


used today. The following two factors are involved: 

1. Evaluation Metric: The metrics used for evaluating 
object proposals are all typically functions of inter¬ 
section over union (lOU) (or Jaccard Index) between 
generated proposals and ground-truth annotations. For 
two boxes/regions bi and hj, lOU is defined as: 




area(bi D bj) 
area{bi U bj) 


( 1 ) 


The following metrics are commonly used: 

• Recall @ lOU Threshold t: For each ground-truth 
instance, this metric checks whether the ‘best’ pro¬ 
posal from list L has lOU greater than a threshold t. If 
so, this ground truth instance is considered ‘detected’ 
or ‘recalled’. Then average recall is measured over all 
the ground truth instances: 


Recall I [max lO\J{gi, Ij) > t], (2) 




I 


where /[•] is an indicator function for the logical 
preposition in the argument. Object proposals are eval¬ 
uated using this metric in two ways: 

- plotting Recall-v5’.-#proposals by fixing t 

- plotting Recall-Vi-.-t by fixing the #proposals in L. 

• Area Under the recall Curve (AUC): AUC summa¬ 
rizes the area under the Recall-v5'.-#proposals plot for 
different values of t in a single plot. This metric mea¬ 
sures AUC-v5'.-#proposals. It is also plotted by varying 
#proposals in L and plotting AUC-vs-t. 

• Volume Under Surface (VUS): This measures the 
average recall by linearly varying t and varying the 
#proposals in L on either linear or log scale. Thus it 
merges both kinds of AUC plots into one. 

• Average Best Overlap (ABO): This metric elimi¬ 
nates the need for a threshold. We first calculate the 
overlap between each ground truth annotation Qi G G, 
and the ‘best’ object hypotheses in L. ABO is calcu¬ 
lated as the average: 

ABO = A V max IOU(gi, Ij) (3) 


ABO is typically is calculated on a per class basis. 
Mean Average Best Overlap (MABO) is defined as the 
mean ABO over all classes. 

• Average Recall (AR): This metric was recently in¬ 
troduced in [59]. Here, average recall (for lOU be¬ 
tween 0.5 to l)-v5'.-#proposals in L is plotted. AR also 
summarizes proposal performance across different val¬ 
ues of t. AR was shown to correlate with ultimate de¬ 
tection performance better than other metrics. 

2. Dataset: The most commonly used datasets are the 
the PASCAL VOC [47] detection datasets. Note that 


these are partially annotated datasets where only the 
20 PASCAL category instances are annotated. Re¬ 
cently analyses have been shown on ImageNet [61], 
which has more categories annotated than PASCAL, 
but is still a partially annotated dataset. 

4. A Thought Experiment: 

How to Game the Evaluation Protocol 

Let us conduct a thought experiment to demonstrate that the 
object proposal evaluation protocol can be ‘gamed’. 

Imagine yourself reviewing a paper claiming to introduce a 
new object proposal method - called DMP 

Before we divulge the details of DMP, consider the perfor¬ 
mance of DMP shown in Fig. 3 on the PASCAL VOC 2010 
dataset, under the AUC-v5'.-#proposals metric. 



Figure 3: Performance of different object proposal methods 
(dashed lines) and our proposed ‘fraudulent’ method (DMP) on the 
PASCAL VOC 2010 dataset. We can see that DMP significantly 
outperforms all other proposal generators. See text for details. 

As we can clearly see, the proposed method DMP signifi¬ 
cantly exceeds all existing proposal methods [1-6, 8,10,1 1] 
(which seem to have little variation over one another). The 
improvement at some points in the curve (e.g., at M=10) 
seems to be an order of magnitude larger than all previous 
incremental improvements reported in the literature! In ad¬ 
dition to the gain in AUC at a fixed M, DMPs also achieves 
the same AUC (0.55) at an order of magnitude fewer num¬ 
ber of proposals (M=10 M= 50 for edgeBoxes [1]). 
Thus, fewer proposals need to be processed by the ensu¬ 
ing detection system, resulting in an equivalent run-time 
speedup. This seems to indicate that a significant progress 
has been made in the field of generating object proposals. 

So what is our proposed state-of-art technique DMP? 

It is a mixture-of-experts model, consisting of 20 experts, 
where each expert is a deep feature (fc7)-based [62] object- 
ness detector. At this point, you, the savvy reader, are prob¬ 
ably already beginning to guess what we did. 

DMP stands for ‘Detector Masquerading as Proposal gener¬ 
ator’. We trained object detectors for the 20 PASCAL cat- 












egories (in this case with RCNN [12]), and then used these 
20 detectors to produce the top-M most confident detections 
(after NMS), and declared them to be ‘object proposals’. 

The point of this experiment is to demonstrate the following 
fact - clearly, no one would consider a collection of 20 ob¬ 
ject detectors to be a category independent object proposal 
method. However, our existing evaluation protocol declared 
the union of these top-M detections to be state-of-the-art. 

Why did this happen? Because the protocol today involves 
evaluating a proposal generator on a partially annotated 
dataset such as PASCAL. The protocol does not reward re¬ 
call of non-PASCAL categories; in fact, early recall (near 
the top of the list of candidates) of non-PASCAL objects 
results in a penalty for the proposal generator! As a result, a 
proposal generator that tunes itself to these 20 PASCAL cat¬ 
egories (either explicitly via training or implicitly via design 
choices or hyper-parameters) will be declared a better pro¬ 
posal generator when it may not be (as illustrated by DMP). 
Notice that as learning-based object proposal methods im¬ 
prove on this metric, “in the limit” the best object proposal 
technique is a detector for the annotated categories, sim¬ 
ilar to our DMP. Thus, we should be cautious of methods 
proposing incremental improvements on this protocol - im¬ 
provements on this protocol do not necessarily lead to a bet¬ 
ter category independent object proposal method. 

This thought experiment exposes the inability of the exist¬ 
ing protocol to evaluate category independence. 

5. Evaluation on Fully and Densely Annotated 
Datasets 

As described in the previous section, the problem of ‘game- 
ability’ is occuring due to the evaluation of proposal meth¬ 
ods on partially annotated datasets. An intuitive solution 
would be evaluating on di fully annotated dataset. 

In the next two subsections, we evaluate the performance 
of 7 popular object proposal methods [1, 3-6, 8, 10] and 
two DMPs (RCNN [12] and DPM [64]) on one nearly-fully 
and two densely annotated datasets containing many more 
object categories. This is to quantify how much the per¬ 
formance of our ‘fraudulent’ proposal generators (DMPs) 
drops once the bias towards the 20 PASCAL categories is 
diminished (or completely removed). 

We begin by creating a nearly-fully annotated dataset by 
building on the effort of PASCAL Context [63] and eval¬ 
uate on this nearly-fully annotated modified instance level 
PASCAL Context; followed by cross-dataset evaluation on 
other partial-but-densely annotated datasets MS COCO [65] 
and NYU-Depth V2 [66]. 

Experimental Setup: On MS COCO and PASCAL Con¬ 
text datasets we conducted experiments as follows: 

• Use the existing evaluation protocol for evaluation, 
i.e., evaluate only on the 20 PASCAL categories. 


• Evaluate on all the annotated classes. 

• For the sake of completeness, we also report results on 
all the classes except the PASCAL 20 classes.^ 

Training of DMPs: The two DMPs we use are based on 
two popular object detectors - DPM [64] and RCNN [12]. 
We train DPM on 20 PASCAL categories and use it as an 
object proposal method. To generate large number of pro¬ 
posals, we chose a low value of threshold in Non-Maximum 
Suppression (NMS). Proposals are generated for each cate¬ 
gory and a score is assigned to them by the corresponding 
DPM for that category. These proposals are then merge- 
sorted on the basis of this score. Top M proposals are se¬ 
lected from this sorted list where M is the number of pro¬ 
posals to be generated. 

Another (stronger) DMP is RCNN which is a detection 
pipeline that uses 20 SVMs (each for one PASCAL cate¬ 
gory) trained on deep features (fc7) [62] extracted on selec¬ 
tive search boxes. Since RCNN itself uses selective search 
proposals, it should be viewed as a trained reranker of se¬ 
lective search boxes. As a consequence, it ultimately equals 
selective search performance once the number of candidates 
become large. We used the pretrained SVM models re¬ 
leased with the RCNN code, which were trained on the 20 
classes of PASCAL VOC 2007 trainval set. For every test 
image, we generate the Selective Search proposals using the 
‘FAST’ mode and calculate the 20 SVM scores for each 
proposal. The ‘objectness’ score of a proposal is then the 
maximum of the 20 SVM scores. All the proposals are then 
sorted by this score and top M proposals are selected.^ 

Object Proposals Library: To ease the process of carry¬ 
ing out the experiments, we created an open source, easy- 
to-use object proposals library. This can be used to seam¬ 
lessly generate object proposals using all the existing algo¬ 
rithms [1-9] (for which the Matlab code has been released 
by the respective authors) and evaluate these proposals on 
any dataset using the commonly used metrics. This library 
will be made publicly available. 

5.1. Fully Annotated Dataset 

PASCAL Context: This dataset was introduced by Mot- 
taghi et al. [63]. It contains additional annotations for all 
images of PASCAL VOC 2010 dataset [67]. The anno¬ 
tations are semantic segmentation maps, where every sin¬ 
gle pixel previously annotated ‘background’ in PASCAL 
was assigned a category label. In total, annotations have 
been provided for 459 categories. This includes the original 
20 PASCAL categories and new classes such as keyboard, 
fridge, picture, cabinet, plate, clock. 

Unfortunately, the dataset contains only category-level se¬ 
mantic segmentations. For our task, we needed instance- 
level bounding box annotations, which cannot be reliably 

^On NYU-Depth V2 performance is only evaluated on all categories. 
This is because only 8 PASCAL categories are present in this dataset. 

^It was observed that merge-sorting calibrated/rescaled SVM scores led 
to inferior performance as compared to merge-sorting without rescaling. 




(a) Average #annotations for (b) Fraction of image-area cov- 
different categories. ered by different categories. 



(c) PASCAL Context annota- (d) Our augmented annotations, 
tions [63]. 


Figure 4: (a),(b) Distribution of object classes in PASCAL Context with respect to different attributes. (c),(d) Augmenting PASCAL 
Context with instance-level annotations. (Green = PASCAL 20 categories; Red = new objects) 
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Figure 5: Performance of different methods on PASCAL Context, MS COCO and NYu Depth-V2 with different sets of annotations. 


extracted from category-level segmentation masks. 

Creating Instance-Level Annotations for PASCAL Con¬ 
text: Thus, we created instance-level bounding box annota¬ 
tions for all images in PASCAL Context dataset. First, out 


of the 459 category labels in PASCAL Context, we identi¬ 
fied 396 categories to be ‘things’, and ignored the remaining 
‘stuff’ or ‘ambiguous’ categories^ - neither of these lend 


a ‘tree’ may be a ‘thing’ or ‘stuff’ subject to camera viewpoint. 






























themselves to bounding-box-based object detection. See 
supplement for details. 

We selected the 60 most frequent non-PASCAL categories 
from this list of ‘things’ and manually annotated all their 
instances. Selecting only top 60 categories is a reason¬ 
able choice because the average per category frequency in 
the dataset for all the other categories (even after including 
background/ambiguous categories) was roughly one third 
as that of the chosen 60 categories (Fig. 4a). Moreover, 
the percentage of pixels in an image left unannotated (as 
‘background’) drops from 58% in original PASCAL to 50% 
in our nearly-fully annotated PASCAL Context. This man¬ 
ual annotation was performed with the aid of the semantic 
segmentation maps present in the PASCAL Context anno¬ 
tations. Examples annotations are shown in Fig. 4d. For 
detailed statistics, see supplement. 

Results and Observations: We now explore how changes 
in the dataset and annotated categories affect the results of 
the thought experiment from Section 4. Figs. 5a, 5b, 5c, 5h 
compare the performance of DMPs with a number of exist¬ 
ing proposal methods 11-6, 8,10,1 1] on PASCAL Context. 

We can see in Column (a) that when evaluated on only 
20 PASCAL categories DMPs trained on these categories 
appear to significantly outperform all proposal generators. 
However, we can see that they are not category independent 
because they suffer a big drop in performance when evalu¬ 
ated on 60 non-PASCAL categories in Column (b). Notice 
that on PASCAL context, all proposal generators suffer a 
drop in performance between the 20 PASCAL categories 
and 60 non-PASCAL categories. We hypothesize that this 
due to the fact that the non-PASCAL categories tend to be 
generally smaller than the PASCAL categories (which were 
the main targets of the dataset curators) and hence difficult 
to detect. But this could also be due to the reason that 
authors of these methods made certain choices while de¬ 
signing these approaches which catered better to the 20 an¬ 
notated categories. However, the key observation here (as 
shown in Fig. 5h) is that DMPs suffer the biggest drop. 
This drop is much greater than all the other approaches. It 
is interesting to note that due to the ratio of instances of 20 
PASCAL categories vs other 60 categories, DMPs continue 
to slightly outperform proposal generators when evaluated 
on all categories, as shown in Column (c). 

5.2. Densely Annotated Datasets 

Besides being expensive, “full” annotation of images is 
somewhat ill-defined due to the hierarchical nature of object 
semantics {e.g. are object-parts such as bicycle-wheel, win¬ 
dows in a building, eyes in a face, etc. also objects?). One 
way to side-step this issue is to use datasets with dense an¬ 
notations (albeit at the same granularity) and conduct cross¬ 
dataset evaluation. 

MS COCO: Microsoft Common Objects in Context (MS 
COCO) dataset 165] contains 91 common object categories 


with 82 of them having more than 5,000 labeled instances. 
It not only has significantly higher number of instances per 
category than the PASCAL, but also considerably more ob¬ 
ject instances per image (7.7) as compared to ImageNet 
(3.0) and PASCAL (2.3). 

NYU-Depth V2: NYU-Depth V2 dataset 166] is com¬ 
prised of video sequences from a variety of indoor scenes 
as recorded by both the RGB and Depth cameras. It fea¬ 
tures 1449 densely labeled pairs of aligned RGB and depth 
images with instance-level annotations. We used these 1449 
densely annotated RGB images for evaluating object pro¬ 
posal algorithms. To the best of our knowledge, this is the 
first paper to compare proposal methods on such a dataset. 

Results and Observations: Figs. 5d, 5e, 5f, 5i show a plot 
similar to PASCAL Context on MS COCO. Again, DMPs 
outperform all other methods on PASCAL categories but 
fail to do so for the Non-PASCAL categories. Fig. 5g shows 
results for NYU-Depth V2. See that when many classes in 
the test dataset are not PASCAL classes, DMPs tend to per¬ 
form poorly, although it is interesting that the performance 
is still not as poor as the worst proposal generators. Results 
on other evaluation criteria are in the supplement. 

6. Bias Inspection 

So far, we have discussed two ways of detecting ‘game- 
ability’ - evaluation on nearly-fully annotated dataset and 
cross-dataset evaluations on densely annotated datasets. 
Although these methods are fairly useful for bias detec¬ 
tion, they have certain limitations. Datasets can be unbal¬ 
anced. Some categories can be more frequent than others 
while others can be hard to detect (due to choices made in 
dataset collection). These issues need to be resolved for 
perfectly unbiased evaluation. However, generating unbi¬ 
ased datasets is an expensive and time-consuming process. 
Hence, to detect the bias without getting unbiased datasets, 
we need a method which can measure performance of pro¬ 
posal methods in a way that category specific biases can be 
accounted for and the extent or the capacity of this bias can 
be measured. We introduce such a method in this section. 

6.1. Assessing Bias Capacity 

Many proposal methods 19-1 1,42^6] rely on explicit train¬ 
ing to learn an “objectness” model, similar to DMPs. De¬ 
pending upon which, how many categories they are trained 
on, these methods could have a biased view of “objectness”. 
One way of measuring the bias capacity in a proposal 
method to plot the performance the number of ‘seen’ 
categories while evaluating on some held-out set. A method 
that involves little or no training will be a fiat curve on 
this plot. Biased methods such as DMPs will get better 
and better as more categories are seen in training. Thus, 
this analysis can help us find biased or ‘gamebility-prone’ 
methods like DMPs that are/can be tuned to specific classes. 
To the best of our knowledge, no previous work has at- 



(a) Area under recall V5. # proposals for var¬ 
ious #seen categories 



(b) Area under recall #training- 

categories. 



(c) Improvement in area under recall from 
#seen categories =10 to 60 V5. # proposals. 


Figure 6: Performance of RCNN and other proposal generators vs number of object categories used for training. We can see that RCNN 
has the most ‘bias capacity’ while the performance of other methods is nearly (or absolutely) constant. 


tempted to measure bias capacity by varying the num¬ 
ber of ‘object’ categories seen at training time. In this 
experiment, we compared the performance of one DMP 
method (RCNN), one learning-based proposal method (Ob- 
jectness), and two non learning-based proposal methods 
(Selective Search [8], EdgeBoxes [1]) as a function of the 
number of ‘seen’ categories (the categories trained on^) on 
MS COCO [65] dataset. Method names ‘RCNNTrainN’, 
‘objectnessTrainN’ indicate that they were trained on im¬ 
ages that contain annotations for only N categories (50 in¬ 
stances per category). Total number of images for all 60 
categories was '-2400 (because some images contain >1 ob¬ 
ject). Once trained, these methods were evaluated on a 
randomly-chosen set of -500 images, which had annnota- 
tions for all 60 categories. 

Fig. 6a shows Area under Recall #proposals curve for 
learning-based methods trained on different sets of cate¬ 
gories. Fig. 6b and Fig. 6c show the variation of AUC vv. 
# seen categories and improvement due to increase in train¬ 
ing categories (from 10 to 60) vv. #proposals respectively, 
for RCNN and objectness when trained on different sets of 
categories. The key observation to make here is that with 
even a modest increase in ‘seen’ categories with the same 
amount of increased training data, performance improve¬ 
ment of RCNN is significantly more than objectness. Se¬ 
lective Search [8] and edgeBoxes [1] are the dashed straight 
lines since there is no training involved. 

These results clearly indicate that as RCNN sees more cat¬ 
egories, its performance improves. One might argue that 
the reason might be that the method is learning more ‘ob¬ 
jectness’ as it is seeing more data. However, as discussed 
above, the increase in the dataset size is marginal (-40 im¬ 
ages per category) and hence it unlikely that such a signif¬ 
icant improvement is observed due to that. Thus, it is rea¬ 
sonable to conclude that this improvement is because the 
method is learning class specific features. 

Thus, this approach can be used to reason about 


^The seen categories are picked in the order they are listed in MS 
COCO dataset (i.e., no specific criterion was used). 


‘gameability-prone’ and ‘gameability-immune’ proposal 
methods without creating an expensive fully annotated 
dataset. We believe this simple but effective diagnostic ex¬ 
periment would help to detect and thus contribute in manag¬ 
ing the category specific bias in all learning-based methods. 

7. Conclusion 

In this paper, we make an explicit distinction between the 
two mutually co-existing but different interpretations of ob¬ 
ject proposals. The current evaluation protocol for ob¬ 
ject proposal methods is suitable only for detection pro¬ 
posals and is a biased ‘gameable’ protocol for category- 
independent object proposals. By evaluating only on a spe¬ 
cific set of object categories, we fail to capture the perfor¬ 
mance of the proposal algorithm on all the remaining object 
categories that are present in the test set, but not annotated 
in the ground truth. We demonstrate this gameability via 
a simple thought experiment where we propose a ‘fraudu¬ 
lent’ object proposal method that outperforms all existing 
object proposal techniques on current metrics. We conduct 
a thorough evaluation of existing object proposal methods 
on three densely annotated datasets. We introduce a fully- 
annotated version of PASCAF VOC 2010 where we anno¬ 
tated all instances of all object categories occurring in all 
images. We hope this dataset will be broadly useful. 

Furthermore, since densely annotating the dataset is a te¬ 
dious and costly task; we proposed a set of diagnostic tools 
to plug the vulnerability of the current protocol. 

Fortunately, we find that none of existing proposal meth¬ 
ods seem to be biased, most of the existing algorithms and 
do generalize well to different datasets and in our experi¬ 
ments even on densely annotated datasets. In that sense, 
our findings are consistent with results in [59]. However, 
that should not prevent us from recognizing and safeguard¬ 
ing against the flaws in the protocol, lest we over-fit as a 
community to a specific set of object classes. 











8. Appendix 

The main paper demonstrated how the object proposal eval¬ 
uation protocol is ‘gameable’ and performed some experi¬ 
ments to detect this ‘gameability’. In this supplement, we 
present additional details and results which support the ar¬ 
guments presented in the main paper. 

In section 8.1, we list and briefly describe the different 
object proposal algorithms which we used for our experi¬ 
ments. Following this, details of instance-level PASCAL 
Context are discussed in section 8.2. Then we present the 
results on nearly-fully annotated dataset, cross dataset eval¬ 
uation on other evaluation metrics in section 8.3. We also 
show the per category performance of various methods on 
MS COCO and PASCAL Context in section 8.4. 

8.1. Overview of Object Proposal Algorithms 

Table 1 provides an overview of some popular object pro¬ 
posal algorithms. The symbol * indicates methods we have 
evaluated in this paper. Note that a majority of the ap¬ 
proaches are learning based. 

8.2. Details of PASCAL Context Annotation 

As explained in section 5.1 of the main paper, PASCAL 
Context provides full annotations for PASCAL VOC 2010 
dataset in the form of semantic segmentations. A total of 
459 classes have labeled in this dataset. We split these into 
three categories namely Objects/Things, Background/Stuff 
and Ambiguous as shown in Tables 2, 4 and 3. Most classes 
(396) were put in the ‘Objects’ category. 20 of these are 
PASCAL categories. Of the remaining 376, we selected the 
most frequently occurring 60 categories and manually cre¬ 
ated instance level annotations for the same. 

Statistics of New Annotations: We made the following ob¬ 
servations on our new annotations: 

• The number of instances we annotated for the extra 
60 categories were about the same as the number of 
instances for annotated for 20 PASCAL categories in 
the original PASCAL VOC. This shows that about half 
the annotations were missing and thus a lot of genuine 
proposal candidates are not being rewarded. 

• Most non-PASCAL categories occupy a small percent¬ 
age of the image. This is understandable given that the 
dataset was curated with these categories. The other 
categories just happened to be in the pictures. 


Ambiguous Classes in PASCAL Context Dataset 

artillery 

escalator 

ice 

speedbump 

bedclothes 

exhibitionbooth 

leaves 

stair 

clothestree 

fiame 

outlet 

tree 

coral 

guardrail 

rail 

unknown 

dais 

handrail 

shelves 



Table 3: Ambiguous Classes in PASCAL Context 


Background/Stuff Classes in PASCAL Context Dataset 


atrium 

floor 

parterre 

sky 

bambooweaving 

foam 

patio 

smoke 

bridge 

footbridge 

pelage 

snow 

building 

goal 

plastic 

stage 

ceiling 

grandstand 

platform 

swimmingpool 

concrete 

grass 

playground 

track 

controlbooth 

ground 

road 

wall 

counter 

hay 

runway 

water 

court 

kitchenrange 

sand 

wharf 

dock 

metal 

shed 

wood 

fence 

mountain 

sidewalk 

wool 


Table 4: Background/Stuff Classes in PASCAL Context 


8.3. Evaluation of Proposals on Other Metrics 

In this section, we show the performance of different pro¬ 
posal methods and DMPs on MS COCO dataset on various 
metrics. Fig. 7a shows performance on Recall-vs-IOU met¬ 
ric at 1000 #proposals on PASCAL 20 categories. Fig. 7b, 
Fig. 7c show performance on Recall-v5'.-#proposals metric 
at 0.5 and 0.7 lOU respectively. Similarly in Figs. 7d,7e, 7f 
and Figs. 7g,7h, 7i, we can see the performance of all pro¬ 
posal methods and DMPs on these three metrics where 60 
non-PASCAL and all categories respectively are annotated 
in the MS COCO dataset. 

These metrics also demonstrate the same trend as shown 
by the AUC-v5'.-#proposals in the main paper. When only 
PASCAL categories are annotated (Figs. 7a,7b, 7c ), DMPs 
outperform all proposal methods. However, when other 
categories are also annotated (Figs. 7g,7h, 7i) or the per¬ 
formance is evaluated specifically on the other categories 
(Figs. 7d,7e, 7f), DMPs cease to be the top performers. 
Finally, we also report results on different metrics PASCAL 
Context (Fig. 8) and NYU-Depth v2 (Fig. 9). They also 
show similar trends, supporting the claims made in the pa¬ 
per. 

8.4. Measuring Fine-Grained Recall 

We also looked at a more fine-grained per-category perfor¬ 
mance of proposal methods and DMPs. Fine grained recall 
can be used to answer if some proposal methods are opti¬ 
mized for larger or frequent categories i.e. if they perform 
better or worse with respect to different object attributes like 
area, kinds of objects, etc. It is also easier to observe the 
change in performance of a particular method on frequently 
occurring category rarely occurring category. We per¬ 
formed this experiment on instance level PASCAL Context 
and MS COCO datasets. We sorted/clustered all categories 
on the basis of: 

• Average size (fraction of image area) of the category, 

• Frequency (Number of instances) of the category, 

• Membership in ‘super-categories’ defined in MS 
COCO dataset (electronics, animals, appliance, etc.). 
10 pre-defined clusters of objects of different kind 
(These clusters are the subset of 11 super-categories 






Method 

Code Source 

Approach 

Learning Involved 

Metric 

Datasets 

ohjectness'' 

Source code from [70] 

Window scoring 

Yes supervised, 

train on 6 PASCAL 
classes and their own 
custom dataset of 50 
images 

Recall @ t > 
0.5 vs # pro¬ 
posals 

PASCAL VOC 07 
test set, test on 
unseen 16 PASCAL 
classes 

selectiveSearch'' 

Source code from [71] 

Segment based 

No 

Recall @ t 
> 0.5 vs # 
proposals, 
MABO, per 
class ABO 

PASCAL VOC 2007 
test set, PASCAL 
VOC 2012 train val 

set 

rahtu'' 

Source code from [72] 

Window Scoring 

Yes, two stages. 
Learning of generic 
bounding box prior 
on PASCAL VOC 
2007 train set, 
weights for fea¬ 
ture combination 

learnt on the dataset 
released with [70] 

Recall @ t 
> various 

loU thresh¬ 
olds and # 
proposals, 

AUC 

PASCAL VOC 2007 

test set 

randomPrim'' 

Source code from [73] 

Segment based 

Yes supervised, train 
on 6 PASCAL cate¬ 
gories 

Recall @ t > 
various lOU 
thresholds 
using 10k and 
Ik proposals 

Pascal VOC 2007 
test set/2012 trainval 
set on 14 categories 
not used in training 

meg'' 

Source code from [74] 

Segment based 

Yes 

NA, only seg¬ 
ments were 

evaluated 

NA (tested on seg¬ 
mentation dataset) 

edgeBoxes" 

Source code from [75] 

Window scoring 

No 

AUC, Recall 
@ t > various 
lOU thresh¬ 
olds and # 
proposals. 
Recall vs loU 

PASCAL VOC 2007 

testset 

hing" 

Source code from [76] 

Window scoring 

Yes supervised, on 
PASCAL VOC 2007 
train set, 20 object 
classes/6 object 

classes 

Recall @ t> 
0.5 vs # pro¬ 
posals 

PASCAL VOC 2007 
detection complete 
test set/14 unseen 
object categories 

rantalankila 

Source code from [77] 

Segment based 

Yes 

NA, only 

segments are 
evaluated 

NA (tested on seg¬ 
mentation dataset) 

Geodesic 

Source code from [78] 

Segment based 

Yes, for seed place¬ 
ment and mask 
construction on 

PASCAL VOC 

2012 S egmentation 
training set 

VUS at 10k 
and 2k win¬ 
dows, Recall 
vs loU thresh¬ 
old, Recall vs 
proposals 

PASCAL 2012 de¬ 
tection validation set 

Rigor 

Source code from [79] 

Segment based 

Yes, pairwise poten¬ 
tials between super 
pixels learned on 
BSDS-500 boundary 
detection dataset 

NA, only seg¬ 
ments were 

evaluated 

NA (tested on seg¬ 
mentation dataset) 

endres 

Source code from [80] 

Segment based 

Yes 

NA, only 

segments are 
evaluated 

NA (tested on seg¬ 
mentation dataset) 


Table 1: Properties of existing bounding box approaches. * indicates the methods which have studied in this paper. 



















Object/Thing Classes in PASCAL Context Dataset 


accordion 

eandleholder 

drainer 

funnel 

aeroplane 

cap 

dray 

furnace 

arrconditioner 

car 

drinkdispenser 

gamecontroller 

antenna 

card 

drinkingmachine 

gamemachine 

ashtray 

cart 

drop 

gascylinder 

babycarriage 

case 

drug 

gashood 

bag 

casetterecorder 

drum 

gasstove 

ball 

cashregister 

drumkit 

giftbox 

balloon 

cat 

duck 

glass 

barrel 

cd 

dumbbell 

glassmarble 

baseballbat 

cdplayer 

earphone 

globe 

basket 

cellphone 

earrings 

glove 

basketballbackboard 

cello 

egg 

gravestone 

bathtub 

chain 

electricfan 

guitar 

bed 

chair 

eleetriciron 

gun 

beer 

chessboard 

electricpot 

hammer 

bell 

chicken 

electricsaw 

handcart 

bench 

chopstick 

eleetronickeyboard 

handle 

bicycle 

clip 

engine 

hanger 

binoculars 

clippers 

envelope 

harddiskdrive 

bird 

clock 

equipment 

hat 

bbdcage 

closet 

extinguisher 

headphone 

birdfeeder 

cloth 

eyeglass 

heater 

bbdnest 

coffee 

fan 

helicopter 

blackboard 

coffeemachine 

faucet 

helmet 

board 

comb 

faxmachine 

holder 

boat 

computer 

ferris wheel 

hook 

bone 

cone 

fireextinguisher 

horse 

book 

container 

firehydrant 

horse-drawncarriage 

bottle 

controller 

fireplaee 

hot-airballoon 

bottleopener 

eooker 

fish 

hydrovalve 

bowl 

copyingmachine 

fishtank 

inflatorpump 

box 

cork 

fishbowl 

ipod 

bracelet 

corkscrew 

fishingnet 

iron 

brick 

cow 

fishingpole 

ironingboard 

broom 

crabstick 

flag 

jar 

brush 

crane 

flagstaff 

kart 

bucket 

crate 

flashlight 

kettle 

bus 

cross 

flower 

key 

cabinet 

crutch 

fly 

keyboard 

eabinetdoor 

cup 

food 

kite 

cage 

curtain 

forceps 

knife 

cake 

cushion 

fork 

knifeblock 

calculator 

cuttingboard 

forklift 

ladder 

calendar 

disc 

fountain 

laddertruck 

camel 

disccase 

fox 

ladle 

camera 

dishwasher 

frame 

laptop 

cameralens 

dog 

fridge 

lid 

can 

dolphin 

frog 

lifebuoy 

candle 

door 

fruit 

light 


lightbulb 

pillar 

sheep 

tire 

lighter 

pillow 

shell 

toaster 

line 

pipe 

shoe 

toilet 

lion 

pitcher 

shoppingcart 

tong 

lobster 

plant 

shovel 

tool 

lock 

plate 

sidecar 

toothbrush 

machine 

player 

sign 

towel 

mailbox 

pliers 

signallight 

toy 

mannequin 

plume 

sink 

toycar 

map 

poker 

skateboard 

train 

mask 

pokerehip 

ski 

trampoline 

mat 

pole 

sled 

trashbin 

matchbook 

pooltable 

slippers 

tray 

mattress 

posteard 

snail 

tricycle 

menu 

poster 

snake 

tripod 

meterbox 

pot 

snowmobiles 

trophy 

microphone 

pottedplant 

sofa 

truck 

microwave 

printer 

spanner 

tube 

mirror 

projector 

spatula 

turtle 

missile 

pumpkin 

speaker 

tvmonitor 

model 

rabbit 

spicecontainer 

tweezers 

money 

racket 

spoon 

typewriter 

monkey 

radiator 

sprayer 

umbrella 

mop 

radio 

squirrel 

vacuumcleaner 

motorbike 

rake 

stapler 

vendingmachine 

mouse 

ramp 

stick 

videocamera 

mousepad 

rangehood 

stickynote 

videogameconsole 

musicalinstrument 

receiver 

stone 

videoplayer 

napkin 

recorder 

stool 

videotape 

net 

recreationalmaehines 

stove 

violin 

newspaper 

remotecontrol 

straw 

wakeboard 

oar 

robot 

stretcher 

wallet 

ornament 

rock 

sun 

wardrobe 

oven 

rocket 

sunglass 

washingmachine 

oxygenbottle 

rockinghorse 

sunshade 

watch 

pack 

rope 

surveillancecamera 

waterdispenser 

pan 

rug 

swan 

waterpipe 

paper 

ruler 

sweeper 

waterskateboard 

paperbox 

saddle 

swimring 

watermelon 

papercutter 

saw 

swing 

whale 

parachute 

scale 

switch 

wheel 

parasol 

scanner 

table 

wheelchair 

pen 

scissors 

tableware 

window 

pencontainer 

scoop 

tank 

windowblinds 

pencil 

screen 

tap 

wineglass 

person 

screwdriver 

tape 

wire 

photo 

sculpture 

tarp 


piano 

scythe 

telephone 


picture 

sewer 

telephonebooth 


pig 

sewingmachine 

tent 



Table 2: Object/Thing Classes in PASCAL Context 


defined in MS COCO dataset for classifying individ¬ 
ual classes in groups of similar objects.) 

Now, we present the plots of recall for all 80 (20 PASCAL 
+ 60 non-PASCAL) categories for the modified PASCAL 
Context dataset and MS COCO. Note that the non-PASCAL 
60 categories are different for both the datasets. 

Trends: Fig. 10 shows the performance of different pro¬ 
posal methods and DMPs along each of these dimensions. 
In Fig. 10a, we see that recall steadily improves perhaps 
as expected, bigger objects are typically easier to find than 
smaller objects. In Fig. 10b, we see that the recall generally 
increases as the number of instances increase except for one 
outlier category. This category was found to be ‘pole’ which 
appears to be quite difficult to recall, since poles are often 
occluded and have a long elongated shape, it is not surpris¬ 
ing that this number is pretty low. Finally, in Fig. 10c we ob¬ 
serve that some super-categories (e.g. outdoor objects) are 


hard to recall while others (e.g. animal, electronics) are rel¬ 
atively easier to recall. It can be seen in Fig. 11, the trends 
on MS COCO are almost similar to PASCAL Context. 

8.5. Change Log 

This section tracks major changes in the paper. 

vl: Initial version. 

v2,v3: Minor modifications in text. 

v4: Current version (more details in section 6.1). 
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